Gett, previously known as GetTaxi, is a developed technology platform solely focused on corporate Ground Transportation Management (GTM). They have an application where clients can order taxis, and drivers can accept their rides (offers). At the moment, when the client clicks the Order button in the application, the matching system searches for the most relevant drivers and offers them the order. In this task, we would like to investigate some matching metrics for orders that did not completed successfully, i.e., the customer didn't end up getting a car.
This data was taken from stratascratch platform.
Data Description We have two data sets: data_orders and data_offers, both being stored in a CSV format. The data_orders data set contains the following columns:
order_datetime - time of the order origin_longitude - longitude of the order origin_latitude - latitude of the order m_order_eta - time before order arrival order_gk - order number order_status_key - status, an enumeration consisting of the following mapping: 4 - cancelled by client, 9 - cancelled by system, i.e., a reject is_driver_assigned_key - whether a driver has been assigned cancellation_time_in_seconds - how many seconds passed before cancellation The data_offers data set is a simple map with 2 columns:
order_gk - order number, associated with the same column from the orders data set offer_id - ID of an offer
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
import h3
pio.renderers.default='notebook'
data_orders = pd.read_csv('data_orders.csv')
data_orders.head()
| order_datetime | origin_longitude | origin_latitude | m_order_eta | order_gk | order_status_key | is_driver_assigned_key | cancellations_time_in_seconds | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18:08:07 | -0.978916 | 51.456173 | 60.0 | 3000583041974 | 4 | 1 | 198.0 |
| 1 | 20:57:32 | -0.950385 | 51.456843 | NaN | 3000583116437 | 4 | 0 | 128.0 |
| 2 | 12:07:50 | -0.969520 | 51.455544 | 477.0 | 3000582891479 | 4 | 1 | 46.0 |
| 3 | 13:50:20 | -1.054671 | 51.460544 | 658.0 | 3000582941169 | 4 | 1 | 62.0 |
| 4 | 21:24:45 | -0.967605 | 51.458236 | NaN | 3000583140877 | 9 | 0 | NaN |
data_offers = pd.read_csv('data_offers.csv')
data_offers.head()
| order_gk | offer_id | |
|---|---|---|
| 0 | 3000579625629 | 300050936206 |
| 1 | 3000627306450 | 300052064651 |
| 2 | 3000632920686 | 300052408812 |
| 3 | 3000632771725 | 300052393030 |
| 4 | 3000583467642 | 300051001196 |
all_offers = data_offers[data_offers['order_gk']==3000583116437]
all_offers
| order_gk | offer_id | |
|---|---|---|
| 3302 | 3000583116437 | 300050986179 |
| 142592 | 3000583116437 | 300050986174 |
| 315434 | 3000583116437 | 300050986180 |
Info
data_offers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 334363 entries, 0 to 334362 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_gk 334363 non-null int64 1 offer_id 334363 non-null int64 dtypes: int64(2) memory usage: 5.1 MB
Cleaning Data: Empty Cells
data_offers[data_offers.isnull().any(axis=1)].count()
order_gk 0 offer_id 0 dtype: int64
Build up distribution of orders according to reasons for failure: cancellations before and after driver assignment, and reasons for order rejection. Analyse the resulting plot. Which category has the highest number of orders?
# data_orders:
# доб. колонку is_offers(data_offers кол-во по order_gk)
df_count = data_offers.groupby(['order_gk']).count().reset_index()
df_count.rename(columns={'offer_id': 'offer_count'}, inplace=True)
merged = pd.concat(
objs=(iDF.set_index('order_gk') for iDF in (data_orders, df_count)),
axis=1,
join='inner'
).reset_index()
# доб. колонку категорию по условию, order_status_key, is_driver_assigned_key (4 кат) - "Category"
conditions = [(merged['order_status_key'] == 4) & (merged['is_driver_assigned_key'] == 1),
(merged['order_status_key'] == 4) & (merged['is_driver_assigned_key'] == 0),
(merged['order_status_key'] == 9) & (merged['is_driver_assigned_key'] == 1),
(merged['order_status_key'] == 9) & (merged['is_driver_assigned_key'] == 0)]
choices = ['by client, after driver assigment', 'by client, before driver assigment', 'by system, after driver assigment', 'by system, before driver assigment']
merged['Category'] = np.select(conditions, choices, default='black')
# Составить график распределения - гистограмма - 1 колонка
groupby_category = merged.groupby(['Category'])['order_gk'].count().reset_index()
groupby_category.rename(columns={'order_gk': 'cancellations'}, inplace=True)
x_labels=[]
for label in groupby_category['Category']:
x_labels.append(label.replace(', ','<br>'))
fig_cancel_category = px.bar(groupby_category, x='Category', y='cancellations',
color_discrete_sequence=['lightseagreen'], text_auto=True)
fig_cancel_category.update_layout(title_text="Cancellations per Category", title_x=0.5, xaxis={'categoryorder':'total descending'})
fig_cancel_category.update_xaxes(tickvals=groupby_category['Category'], ticktext=x_labels)
fig_cancel_category.update_xaxes(type='category')
fig_cancel_category.show()
Plot the distribution of failed orders by hours. Is there a trend that certain hours have an abnormally high proportion of one category or another? What hours are the biggest fails? How can this be explained?
# data_orders: group by hour(order_datetime), кол-во отказов по category - category - колоннки
merged['order_datetime'] = pd.to_datetime(merged['order_datetime'], format='%H:%M:%S')
merged['hour'] = merged['order_datetime'].dt.hour
groupby_hour = merged.groupby(['hour','Category'])['order_gk'].count().reset_index()
groupby_hour.rename(columns={'order_gk': 'cancellations'}, inplace=True)
# Получить данные в группировке: час / категории - значения в категории - это кол-во отказов
stacked_df = groupby_hour.pivot_table(values='cancellations',index='hour',columns='Category').fillna(0)
# Составить график распределения - stacked bar - df2.plot.barh(stacked=True)
# stacked_df.plot.bar(stacked=True, rot=0)
fig_cancel_hour = px.bar(groupby_hour, x='hour', y='cancellations', color='Category',
color_discrete_sequence=['lightsalmon', 'lightseagreen',
'lightskyblue', 'lightslategray'])
fig_cancel_hour.update_layout(title_text="Cancellations per hour", title_x=0.3)
fig_cancel_hour.update_xaxes(type='category')
fig_cancel_hour.show()
Plot the average time to cancellation with and without driver, by the hour. If there are any outliers in the data, it would be better to remove them. Can we draw any conclusions from this plot?
# data_orders: find outliners in cancellation_time_in_seconds
# Разделим на две дата фрейм - с водителем/нет
# Оставить только hour(order_datetime) и 'cancellation_time_in_seconds'
df_with_driver = merged[merged['is_driver_assigned_key'] == 1]
df_with_driver_cancel = df_with_driver[['hour', 'cancellations_time_in_seconds']]
df_without_driver = merged[merged['is_driver_assigned_key'] == 0]
df_without_driver_cancel = df_without_driver[['hour', 'cancellations_time_in_seconds']]
# Преобразуем для графика
df_with_driver_pivot = df_with_driver_cancel.pivot_table(values='cancellations_time_in_seconds', index=df_with_driver_cancel.index, columns='hour', aggfunc='first')
df_without_driver_pivot = df_without_driver_cancel.pivot_table(values='cancellations_time_in_seconds', index=df_without_driver_cancel.index, columns='hour', aggfunc='first')
df_without_driver_pivot.describe()
| hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 218.000000 | 155.000000 | 164.000000 | 178.000000 | 33.000000 | 13.000000 | 48.000000 | 131.000000 | 336.000000 | 119.000000 | ... | 81.000000 | 148.000000 | 94.000000 | 162.000000 | 108.000000 | 67.000000 | 163.000000 | 365.000000 | 231.000000 | 267.000000 |
| mean | 95.756881 | 91.283871 | 112.335366 | 110.050562 | 81.636364 | 113.230769 | 139.270833 | 117.977099 | 111.446429 | 117.025210 | ... | 85.901235 | 106.297297 | 96.053191 | 88.512346 | 68.074074 | 72.552239 | 98.239264 | 116.526027 | 88.359307 | 97.707865 |
| std | 89.814969 | 69.392705 | 88.731917 | 83.632994 | 77.545075 | 74.605802 | 189.925040 | 89.517292 | 93.357075 | 107.378742 | ... | 102.675777 | 118.710902 | 140.975508 | 90.154542 | 61.304923 | 91.126001 | 78.511762 | 89.357828 | 78.830340 | 89.700194 |
| min | 5.000000 | 5.000000 | 4.000000 | 6.000000 | 5.000000 | 9.000000 | 5.000000 | 5.000000 | 3.000000 | 6.000000 | ... | 4.000000 | 4.000000 | 7.000000 | 4.000000 | 7.000000 | 6.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 |
| 25% | 28.250000 | 36.000000 | 49.750000 | 40.250000 | 12.000000 | 68.000000 | 34.000000 | 50.000000 | 47.750000 | 43.500000 | ... | 14.000000 | 31.750000 | 23.500000 | 31.500000 | 20.000000 | 13.000000 | 34.000000 | 51.000000 | 28.000000 | 30.000000 |
| 50% | 72.500000 | 77.000000 | 95.500000 | 89.500000 | 53.000000 | 109.000000 | 102.000000 | 95.000000 | 91.000000 | 89.000000 | ... | 65.000000 | 82.000000 | 59.000000 | 70.500000 | 44.000000 | 50.000000 | 82.000000 | 106.000000 | 69.000000 | 77.000000 |
| 75% | 141.500000 | 122.000000 | 168.000000 | 165.000000 | 148.000000 | 177.000000 | 180.250000 | 183.500000 | 166.750000 | 169.500000 | ... | 120.000000 | 148.500000 | 117.750000 | 124.000000 | 96.750000 | 91.500000 | 153.500000 | 162.000000 | 122.500000 | 151.000000 |
| max | 833.000000 | 343.000000 | 640.000000 | 592.000000 | 234.000000 | 250.000000 | 1179.000000 | 467.000000 | 725.000000 | 735.000000 | ... | 782.000000 | 930.000000 | 1035.000000 | 862.000000 | 243.000000 | 557.000000 | 368.000000 | 831.000000 | 625.000000 | 723.000000 |
8 rows × 24 columns
# Function to remove outlier
def removeOutliers(df):
for (columnName, columnData) in df.items():
Q1 = columnData.quantile(0.25)
Q3 = columnData.quantile(0.75)
IQR = Q3 - Q1
# identify outliers
threshold = 1.5
outliers = df[(df[columnName] < Q1 - threshold * IQR) | (df[columnName] > Q3 + threshold * IQR)]
df = df.drop(outliers.index)
return df
# Remove outlier
df_with_driver_clean = removeOutliers(df_with_driver_pivot)
df_without_driver_clean = removeOutliers(df_without_driver_pivot)
df_with_driver_clean.describe()
| hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 84.000000 | 72.000000 | 70.000000 | 57.000000 | 30.000000 | 20.000000 | 42.000000 | 138.000000 | 235.000000 | 130.000000 | ... | 77.000000 | 114.000000 | 95.000000 | 110.000000 | 97.000000 | 88.000000 | 89.000000 | 81.000000 | 102.000000 | 110.000000 |
| mean | 202.488095 | 195.166667 | 290.871429 | 237.298246 | 149.033333 | 149.700000 | 135.833333 | 149.543478 | 113.446809 | 148.015385 | ... | 141.974026 | 139.438596 | 138.568421 | 147.281818 | 177.701031 | 140.147727 | 151.685393 | 174.148148 | 177.117647 | 207.963636 |
| std | 174.277683 | 194.603751 | 292.311226 | 254.335476 | 159.291423 | 160.023715 | 149.869686 | 135.180988 | 86.798423 | 143.566746 | ... | 134.062728 | 121.472458 | 144.318181 | 153.584817 | 186.717097 | 127.893323 | 152.862505 | 171.203980 | 178.329725 | 211.009019 |
| min | 14.000000 | 9.000000 | 16.000000 | 11.000000 | 10.000000 | 11.000000 | 14.000000 | 6.000000 | 13.000000 | 12.000000 | ... | 17.000000 | 9.000000 | 8.000000 | 15.000000 | 13.000000 | 17.000000 | 13.000000 | 7.000000 | 11.000000 | 10.000000 |
| 25% | 48.750000 | 45.000000 | 52.000000 | 50.000000 | 33.250000 | 26.750000 | 38.000000 | 45.250000 | 45.000000 | 45.000000 | ... | 45.000000 | 49.000000 | 43.500000 | 42.750000 | 53.000000 | 46.000000 | 43.000000 | 53.000000 | 41.500000 | 47.250000 |
| 50% | 157.000000 | 132.000000 | 180.000000 | 98.000000 | 77.500000 | 86.500000 | 72.000000 | 108.500000 | 90.000000 | 87.000000 | ... | 70.000000 | 100.000000 | 73.000000 | 79.000000 | 79.000000 | 84.000000 | 80.000000 | 99.000000 | 99.500000 | 142.500000 |
| 75% | 295.000000 | 268.000000 | 478.250000 | 394.000000 | 212.250000 | 228.500000 | 158.000000 | 219.500000 | 158.500000 | 199.250000 | ... | 196.000000 | 184.750000 | 186.000000 | 214.750000 | 226.000000 | 194.000000 | 222.000000 | 281.000000 | 254.500000 | 336.000000 |
| max | 679.000000 | 753.000000 | 1146.000000 | 830.000000 | 630.000000 | 526.000000 | 576.000000 | 570.000000 | 390.000000 | 622.000000 | ... | 543.000000 | 518.000000 | 555.000000 | 624.000000 | 761.000000 | 487.000000 | 581.000000 | 689.000000 | 646.000000 | 833.000000 |
8 rows × 24 columns
# Average time for cancellations
df_with_driver_mean = df_with_driver_clean.mean().to_frame().reset_index()
df_with_driver_mean.rename(columns={0: 'mean_cancellations_time'}, inplace=True)
df_without_driver_mean = df_without_driver_clean.mean().to_frame().reset_index()
df_without_driver_mean.rename(columns={0: 'mean_cancellations_time'}, inplace=True)
# Barchart with_driver
fig_mean_cancel_time = go.Figure()
fig_mean_cancel_time.add_bar(x=df_with_driver_mean['hour'], y=df_with_driver_mean['mean_cancellations_time'],
name="with driver", marker_color='lightseagreen')
fig_mean_cancel_time.add_bar(x=df_without_driver_mean['hour'], y=df_without_driver_mean['mean_cancellations_time'],
name="without driver", marker_color='lightslategray')
fig_mean_cancel_time.update_layout(title_text="Mean cancellations time per hour", title_x=0.5)
fig_mean_cancel_time.update_xaxes(type='category')
fig_mean_cancel_time.show()
Plot the distribution of average ETA by hours. How can this plot be explained?
# data_orders: group by hour(order_datetime), - значение ['m_order_eta'].mean()
# Преобразуем для графика
df_with_driver_eta = df_with_driver[['hour', 'm_order_eta']]
df_with_driver_eta_pivot = df_with_driver_eta.pivot_table(values='m_order_eta', index=df_with_driver.index, columns='hour', aggfunc='first')
# Remove Outliers
df_with_driver_eta_clean = removeOutliers(df_with_driver_eta_pivot)
# Average time for ETA
df_with_driver_eta_mean = df_with_driver_eta_clean.mean().to_frame().reset_index()
df_with_driver_eta_mean.rename(columns={0: 'mean_eta'}, inplace=True)
# Barchart with_driver
fig_mean_eta = px.bar(df_with_driver_eta_mean, x='hour', y='mean_eta',
color_discrete_sequence=['lightseagreen'])
fig_mean_eta.update_layout(title_text="Mean eta", title_x=0.5)
fig_mean_eta.update_xaxes(type='category')
fig_mean_eta.show()
df_scatter = data_orders
fig = px.scatter(df_scatter, x="origin_longitude", y="origin_latitude", color='m_order_eta')
fig.show()
df_scatter2 = merged
fig = px.scatter(df_scatter2, x="origin_longitude", y="origin_latitude", color='Category')
fig.show()